Revise Process to Support the Updated Format

Last Updated: May 19th, 2026

Warning2023 vs. 2026 Format

In May 2026, an updated version of the raw data was provided in a different format than the version exported in July 2023 and provided in the summer of 2025. As a result, data processing was split into two paths: one for the 2023 format and one for the 2026 format.

The 2026 versions of comparable sections should be treated as the most current method applicable for ongoing research. Insights from the 2023 format are assumed to apply to the 2026 format as well. Unexpected outcomes or shortcomings in algorithm performance identified in the 2023 drafts were used to inform improvements applied in the 2026 version.

The purpose of this page is to detail the differences between the 2023 and 2026 formats and provide a framework for handling the additional metadata. The insights detailed in Review - 2023 Format are assumed to translate to this format, as the original data source is the same. In addition to the new metadata, the data is also structured in a long format, where additional years are represented as rows rather than columns. This report details how these nuances impact the data transformation and how that information is captured when translated into a years-by-column format.

Selected code excerpts are provided below. The complete code used to generate this report can be found in Code/Process Data Update.R. Functions referenced can be found in Code/Support Functions/For Processing New Data.R.

Raw Data Dictionary

File name: church_long_form_050926.csv
Number of rows: 10,201,197
Number of columns: 54

  • company: The business name.

  • archive_version_year: The year the information was added, spanning 2000 to 2025.

  • abi: American Business Identifier (ABI): The unique identifier for each business, used to associate addresses with a specific organization.

  • year_established: Indicates the year the business was first established.

  • primary_naics8_code: This dataset combines the six-digit 2022 vintage North American Industry Classification System (NAICS) code and a two-digit proprietary code from Data Axle. The two-digit Data Axle code provides additional classification details not reflected here.

  • naics8_descriptions: All entries have NAICS = 813110, indicating they are classified as “Religious Organizations”, which includes churches, shrines, monasteries (except schools), synagogues, mosques, and temples.

  • subsidiary_number: A unique identifier assigned to a subsidiary of the parent company, used to link the business location back to its specific subsidiary entity within the broader corporate hierarchy.

  • company_holding_status: Indicates the holding or ownership status of the company, distinguishing between classifications such as headquarters, branch, subsidiary, or standalone entity within a corporate structure.

  • address_line_1: The primary address line associated with the ABI entry.

  • city: The city associated with the address.

  • state: The state associated with the address.

  • zipcode: The five-digit ZIP code associated with the address.

  • zip4: The four-digit ZIP code extension associated with the address.

  • primary_sic_code: The primary Standard Industrial Classification (SIC) code assigned to the business, representing its main industry or business activity.

  • sic6_descriptions: Description of the SIC code.

  • sic_code_[1-4]: Up to four additional SIC codes assigned to the business, capturing additional or secondary industries and activities beyond the primary classification.

  • sic6_descriptions_sic[1-4]: Description of the addditional SIC codes.

  • longitude and latitude: Given as two columns. Provides the geolocation associated with the address listed in that row.

  • census_block: The smallest geographic unit defined by the U.S. Census Bureau. Provides a precise sub-tract level location identifier associated with the address listed in that row.

  • census_tract: A small, relatively permanent geographic subdivision of a county defined by the U.S. Census Bureau, used for statistical and demographic analysis.

  • county_code: A numeric code identifying the county associated with the address listed in that row.

  • fips_code: The Federal Information Processing Standards (FIPS) code assigned to the geographic area, providing a standardized identifier for states, counties, and other administrative regions.

  • cbsa_level: Indicates the level of the Core-Based Statistical Area (CBSA), distinguishing between Metropolitan Statistical Areas (MSAs) and Micropolitan Statistical Areas.

  • cbsa_code: A numeric code identifying the Core-Based Statistical Area (CBSA) associated with the address, as defined by the U.S. Office of Management and Budget (OMB).

  • csa_code: A numeric code identifying the Combined Statistical Area (CSA) associated with the address, representing a grouping of adjacent CBSAs with strong economic and social ties.

  • area_code: The telephone area code associated with the business location.

  • site_number: A unique identifier assigned to a specific physical location or site of a business, used to distinguish individual locations within the same organization.

  • yellow_page_code: A code indicating the Yellow Pages heading(s) under which the business is listed, reflecting how the business advertises or categorizes itself in directory listings.

  • office_size_code: A coded value representing the size of the office or physical location.

  • employee_size_location and location_employee_size_code: Given as two columns. Provides the actual employee count at the specific location and a corresponding code representing that count within a size classification range.

  • sales_volume_location and location_sales_volume_code: Given as two columns. Provides the actual sales attributed to the specific location and a corresponding code representing that number within a size classification range.

  • parent_number: A unique identifier assigned to the parent company of the business, used to link individual locations or subsidiaries back to their controlling organization.

  • parent_actual_employee_size and parent_employee_size_code: Given as two columns. Provides the total employee count for the parent company and a corresponding code representing that count within a size classification range.

  • parent_actual_sales_volume and parent_sales_volume_code: Given as two columns. Provides the total sales for the parent company and a corresponding code representing that number within a size classification range.

  • Unknown variables or code translations: address_type_indicator, industry_specific_first_byte, business_status_code, population_code, match_code, ticker, and idcode.

Correspondance to the 2023 Format

Core fields (abi, address_line_1, city, and state) are consistent between the old and new data. Not shown here, the 2026 format introduces a variable archive_version_year that appends new years of information as rows, whereas the 2023 format extended this information column-wise as binary columns. Notably, the new data stores the zip code as a five-digit character field, preserving flanking zeros, and includes a four-digit zip code extension.

                       Variable Metadata Var Class_2026 Var Class_2023
1                       company     TRUE      character           <NA>
2                           abi    FALSE        numeric        numeric
3              year_established     TRUE        numeric        numeric
4           primary_naics6_code     TRUE        numeric        numeric
5           naics6_descriptions     TRUE      character      character
6             subsidiary_number     TRUE      character           <NA>
7        company_holding_status     TRUE        numeric           <NA>
8                address_line_1    FALSE      character      character
9                          city    FALSE      character      character
10                        state    FALSE      character      character
11                      zipcode    FALSE      character        numeric
12                         zip4    FALSE      character           <NA>
13             primary_sic_code     TRUE        numeric           <NA>
14            sic6_descriptions     TRUE      character           <NA>
15                     sic_code     TRUE        numeric           <NA>
16        sic6_descriptions_sic     TRUE      character           <NA>
17                   sic_code_1     TRUE        numeric           <NA>
18       sic6_descriptions_sic1     TRUE      character           <NA>
19                   sic_code_2     TRUE        numeric           <NA>
20       sic6_descriptions_sic2     TRUE      character           <NA>
21                   sic_code_3     TRUE        numeric           <NA>
22       sic6_descriptions_sic3     TRUE      character           <NA>
23                   sic_code_4     TRUE        logical           <NA>
24       sic6_descriptions_sic4     TRUE        logical           <NA>
25                     latitude    FALSE        numeric        numeric
26                    longitude    FALSE        numeric        numeric
27                 census_block    FALSE        numeric           <NA>
28                 census_tract    FALSE      character           <NA>
29                  county_code    FALSE      character           <NA>
30                    fips_code    FALSE      character           <NA>
31                   cbsa_level    FALSE        numeric           <NA>
32                    cbsa_code    FALSE      character           <NA>
33                     csa_code    FALSE      character           <NA>
34                    area_code    FALSE        numeric           <NA>
35                  site_number     TRUE      character           <NA>
36             yellow_page_code     TRUE        numeric           <NA>
37             office_size_code     TRUE        logical           <NA>
38       employee_size_location     TRUE        numeric           <NA>
39  location_employee_size_code     TRUE      character           <NA>
40        sales_volume_location     TRUE        numeric           <NA>
41   location_sales_volume_code     TRUE      character           <NA>
42                parent_number     TRUE      character           <NA>
43  parent_actual_employee_size     TRUE        numeric           <NA>
44    parent_employee_size_code     TRUE      character           <NA>
45   parent_actual_sales_volume     TRUE        numeric           <NA>
46     parent_sales_volume_code     TRUE      character           <NA>
47          primary_naics2_code     TRUE        numeric        numeric
48       address_type_indicator     TRUE      character           <NA>
49 industry_specific_first_byte     TRUE      character           <NA>
50         business_status_code     TRUE        numeric           <NA>
51              population_code     TRUE        numeric           <NA>
52                   match_code     TRUE      character           <NA>
53                       ticker     TRUE        logical           <NA>
54                       idcode     TRUE        numeric           <NA>

The new dataset also introduces additional columns covering census and location attributes, as well as business-level details including size, status, and SIC encoding. Because the data structure appends new information row-wise, care must be taken to ensure these metadata fields are correctly collapsed when the data is transformed into single rows per ABI and address. Ideally, the structural transformation should occur after addresses have been grouped and matched to similar ones, to reduce redundancies introduced by typographical errors.

Variables can be categorized as follows:

                       Variable                                  Group
1                       company                  Core business details
2                           abi                  Core business details
3              year_established                  Core business details
4           primary_naics6_code                  Core business details
5           naics6_descriptions                  Core business details
6             subsidiary_number                  Core business details
7        company_holding_status                  Core business details
8                address_line_1      Core business details by location
9                          city      Core business details by location
10                        state      Core business details by location
11                      zipcode      Core business details by location
12                         zip4      Core business details by location
13             primary_sic_code      Core business details by location
14            sic6_descriptions      Core business details by location
15                     sic_code      Core business details by location
16        sic6_descriptions_sic      Core business details by location
17                   sic_code_1      Core business details by location
18       sic6_descriptions_sic1      Core business details by location
19                   sic_code_2      Core business details by location
20       sic6_descriptions_sic2      Core business details by location
21                   sic_code_3      Core business details by location
22       sic6_descriptions_sic3      Core business details by location
23                   sic_code_4      Core business details by location
24       sic6_descriptions_sic4      Core business details by location
25                     latitude                  Core location details
26                    longitude                  Core location details
27                 census_block                  Core location details
28                 census_tract                  Core location details
29                  county_code                  Core location details
30                    fips_code                  Core location details
31                   cbsa_level                  Core location details
32                    cbsa_code                  Core location details
33                     csa_code                  Core location details
34                    area_code                  Core location details
35                  site_number      Core business details by location
36             yellow_page_code      Core business details by location
37             office_size_code      Core business details by location
38       employee_size_location      Core business details by location
39  location_employee_size_code      Core business details by location
40        sales_volume_location      Core business details by location
41   location_sales_volume_code      Core business details by location
42                parent_number                 Parent company details
43  parent_actual_employee_size                 Parent company details
44    parent_employee_size_code                 Parent company details
45   parent_actual_sales_volume                 Parent company details
46     parent_sales_volume_code                 Parent company details
47          primary_naics2_code Unknown variables or code translations
48       address_type_indicator Unknown variables or code translations
49 industry_specific_first_byte Unknown variables or code translations
50         business_status_code Unknown variables or code translations
51              population_code Unknown variables or code translations
52                   match_code Unknown variables or code translations
53                       ticker Unknown variables or code translations
54                       idcode Unknown variables or code translations

By this classification, there are are two types of metadata:

  1. Metadata relevant to a given business or location that does not change across years (e.g. year_established, fips_code)

  2. Metadata that may change across years (e.g. census_block, employee_size_location)

This analysis requires only most of the core business and location details. The remaining metadata will be processed simultaneously to produce a final, cleaned dataset, though not with any specific follow-up analysis in mind. As a result, the metadata may not be in a format suitable for all analyses, as the data are primarily prepared for the dashboard.

Metadata are processed by considering change across location (distinguishing between values consistent for a given business and those that vary by location) and year. For both types of location variation, value changes by year are captured to maintain granularity. While these are not expected to indicate anything actionable, they may reveal opportunities for standardization or data cleaning and validation. Some of these opportunities will be taken here, while the remainder will be left for follow-up analyses.

To capture year-level granularity in value changes, a date range will be added to each unique value when collapsed by abi and combined_address (a formatted combination of all address components). For fields that are relevant to this project, unique values that do not span the expected years — the years in which that abi and address were recorded — will be cleaned to report only the distinct value.

In the following sections, specific metadata fields are assessed for quality and to develop this condensing approach.

NAICS and SIC Encodings

NAICS and SIC encodings provide additional context about the type of business, religious affiliation, and function. The 2023 form of the data only included the NAICS encoding, while the 2026 form includes both.

Primary NAICS and SIC Encodings

When processing the 2023 data, it was identified that the NAICS code is eight digits long, with the trailing two digits representing an unknown proprietary code from Data Axle. These will be separated into a six-digit (naics6) code with description and a two-digit (naics2) code with no associated description.

Both versions of the data share the same unique NAICS eight-digit encodings. The 2023 data processing protocol retained only the unique six-digit NAICS codes and their descriptions, which are consistent across all entries; the trailing two digits were stored as a string. The method devised to retain applicable years will also be applied to this field.

SIC refines the classification of religious organizations by capturing attributes such as affiliation, denomination, and operational function (e.g., school, retreat center). This information was not provided in the previous version. It is provided as both a primary classification column accompanied by up to four additional overflow columns. All are examined together in the following section.

Primary and Additional SIC Encodings

The following assumptions about the SIC encodings require review and confirmation:

  1. Overflow columns are filled progressively with no other discernible pattern.

  2. The same categories appear across all columns, with the possible exception of the primary SIC column, which may differ or be restricted.

  3. Each code maps to a unique description and vice versa.

  4. Nomenclature is consistent throughout.

Important

The following results apply to the most recent data export in the 2026 format at the time of this writing. To ensure subsequently exported data in this format are appropriately processed, they will need to be reaudited to confirm the validity of these assumptions or flag any corrections that are required.

Show the code
# SIC column auditing function devised to quickly assess the above assumptions
# for recently exported data in the 2026 format.

sic_overflow_audit <- function(df,
                               sic_code_cols = c("primary_sic_code","sic_code","sic_code_1","sic_code_2","sic_code_3","sic_code_4"),
                               sic_desc_cols = c("sic6_descriptions","sic6_descriptions_sic","sic6_descriptions_sic1",
                                                 "sic6_descriptions_sic2","sic6_descriptions_sic3","sic6_descriptions_sic4"),
                               id_cols = "abi") {
  #' @description
  #' Audits a dataset that stores SIC classifications in a primary column plus
  #' multiple “overflow” columns (e.g., `sic_code_1` ... `sic_code_4`) with a
  #' corresponding set of description columns. The function:
  #' (1) runs an overflow-pattern check on codes and descriptions,
  #' (2) identifies rows where the description overflow pattern fails (often due to
  #'     missing descriptions),
  #' (3) quantifies overlap in (code, desc) pairs across sources,
  #' (4) builds a unified (sic_code -> sic_desc) mapping table and checks one-to-one
  #'     consistency, and
  #' (5) identifies SIC codes that appear sometimes with NA descriptions and sometimes
  #'     with non-NA descriptions.
  #'
  #' @param df A data.frame/tibble containing SIC code and description columns.
  #' @param sic_code_cols Character vector of SIC code column names, ordered from
  #'   primary to most-overflowed.
  #' @param sic_desc_cols Character vector of SIC description column names aligned
  #'   1:1 with `sic_code_cols`.
  #' @param id_cols Optional character vector of identifier columns to retain when
  #'   returning problematic rows (default: `"abi"`). Only columns present in `df`
  #'   are used.
  #'
  #' @details
  #' This function assumes you have an `overflow_check(df, cols)` helper available
  #' in your environment that returns a data frame with (at least) an `overflow_ok`
  #' logical column describing whether each row follows the expected overflow pattern.
  #'
  #' @return A named list containing:
  #' \describe{
  #'   \item{sic_code_cols_used}{Aligned SIC code columns actually used (present in `df`).}
  #'   \item{sic_desc_cols_used}{Aligned SIC description columns actually used (present in `df`).}
  #'   \item{overflow_checked_code}{Row-level output of `overflow_check()` for code columns.}
  #'   \item{overflow_checked_desc}{Row-level output of `overflow_check()` for description columns.}
  #'   \item{overflow_summary_code}{Counts of rows that follow the progressive overflow pattern (OK vs not OK) for codes.}
  #'   \item{overflow_summary_desc}{Counts of rows that follow the progressive overflow pattern (OK vs not OK) for descriptions.}
  #'   \item{missing_desc_rows}{Subset of rows where description overflow check fails.}
  #'   \item{missing_desc_code}{Unique SIC codes that appear with NA descriptions among failing rows.}
  #'   \item{missing_desc_codes_are_subset_of_no_desc}{Logical check: confirm if all SIC codes correspond to NA description.}
  #'   \item{presence_wide}{Unique (sic_code, sic_desc) pairs widened to show presence by column.}
  #'   \item{presence_tabs}{Tabulations of presence (TRUE/FALSE/NA) by source column.}
  #'   \item{map_tbl}{Unified unique mapping of non-NA (sic_code, sic_desc) pairs.}
  #'   \item{n_unique_pairs}{Number of unique non-NA pairs in `map_tbl`.}
  #'   \item{code_to_desc_consistent}{TRUE if each sic_code maps to exactly one sic_desc.}
  #'   \item{desc_to_code_consistent}{TRUE if each sic_desc maps to exactly one sic_code.}
  #'   \item{codes_with_multiple_desc}{SIC codes mapping to != 1 description (if any).}
  #'   \item{desc_with_multiple_codes}{Descriptions mapping to != 1 code (if any).}
  #'   \item{codes_with_mixed_na_desc}{Non-NA mappings for codes that are NA somewhere else.}
  #' }
  #'
  #' @export
  
  # ---- Basic input validation ----
  stopifnot(
    is.data.frame(df),
    is.character(sic_code_cols),
    is.character(sic_desc_cols),
    length(sic_code_cols) == length(sic_desc_cols)
  )
  
  # ---- Align code/desc pairs to columns that actually exist in df ----
  # Keep only those indices where BOTH the code col and the paired desc col exist.
  code_present <- intersect(sic_code_cols, names(df))
  desc_present <- intersect(sic_desc_cols, names(df))
  
  idx_code <- match(code_present, sic_code_cols)
  idx_desc <- match(desc_present, sic_desc_cols)
  keep_idx <- intersect(idx_code, idx_desc)
  
  sic_code_cols <- sic_code_cols[keep_idx]
  sic_desc_cols <- sic_desc_cols[keep_idx]
  
  # Keep only ID columns that exist.
  id_cols <- intersect(id_cols, names(df))
  
  # ---- Overflow checks (requires overflow_check() in your environment) ----
  overflow_checked_code <- overflow_check(df, sic_code_cols)
  overflow_checked_desc <- overflow_check(df, sic_desc_cols)
  
  # Summaries: how many rows follow the expected pattern vs not.
  overflow_summary_code <- overflow_checked_code %>%
    dplyr::summarise(
      n_rows = dplyr::n(),
      n_ok   = sum(overflow_ok, na.rm = TRUE),
      n_bad  = sum(!overflow_ok, na.rm = TRUE)
    )
  
  overflow_summary_desc <- overflow_checked_desc %>%
    dplyr::summarise(
      n_rows = dplyr::n(),
      n_ok   = sum(overflow_ok, na.rm = TRUE),
      n_bad  = sum(!overflow_ok, na.rm = TRUE)
    )
  
  # Rows where the *description* overflow pattern fails
  # (often because a code exists but its paired description is missing).
  missing_desc <- overflow_checked_desc %>%
    dplyr::filter(overflow_ok == FALSE) %>%
    dplyr::select(dplyr::any_of(c(id_cols, sic_code_cols, sic_desc_cols)))
  
  # ---- Presence/overlap across sources (code+desc pairs) ----
  # Each "source" is a column pairing such as "sic_code_1 + sic6_descriptions_sic1"
  all_sources <- paste0(sic_code_cols, " + ", sic_desc_cols)
  
  # Long table of unique pairs per source (keeping NA on either side so we can see gaps).
  long <- purrr::map2_dfr(sic_code_cols, sic_desc_cols, \(cc, dc) {
    df %>%
      dplyr::transmute(
        sic_code = as.character(.data[[cc]]),
        sic_desc = as.character(.data[[dc]])
      ) %>%
      dplyr::filter(!is.na(sic_code) | !is.na(sic_desc)) %>%
      dplyr::distinct() %>%
      dplyr::mutate(
        source  = paste0(cc, " + ", dc),
        present = TRUE
      )
  })
  
  # Widen to one row per unique (sic_code, sic_desc), with TRUE/FALSE presence by source.
  presence_wide <- long %>%
    dplyr::mutate(source = factor(source, levels = all_sources)) %>%
    tidyr::pivot_wider(
      names_from  = source,
      values_from = present,
      values_fill = FALSE
    )
  
  # Tabulate presence by the *source columns that actually exist* in presence_wide.
  # (Some expected sources may not appear if there were zero observations.)
  source_cols_present <- intersect(all_sources, names(presence_wide))
  presence_tabs <- lapply(source_cols_present, \(nm) table(presence_wide[[nm]], useNA = "ifany"))
  presence_tabs <- do.call(rbind, presence_tabs)
  rownames(presence_tabs) <- source_cols_present
  
  # ---- SIC codes that are missing descriptions (robust to any number of pairs) ----
  # Build all (sic_code, sic_desc) pairs from the failing rows and capture codes with NA desc.
  missing_desc_code <- character(0)
  if (nrow(missing_desc) > 0 && length(sic_code_cols) > 0) {
    
    miss_long <- purrr::map2_dfr(sic_code_cols, sic_desc_cols, \(cc, dc) {
      missing_desc %>%
        dplyr::transmute(
          sic_code = as.character(.data[[cc]]),
          sic_desc = as.character(.data[[dc]])
        )
    })
    
    missing_desc_code <- miss_long %>%
      dplyr::filter(is.na(sic_desc), !is.na(sic_code)) %>%
      dplyr::distinct(sic_code) %>%
      dplyr::pull(sic_code)
  }
  
  # All SIC codes paired with NA descriptions anywhere in the presence table.
  no_desc <- presence_wide %>%
    dplyr::filter(is.na(sic_desc)) %>%
    dplyr::pull(sic_code)
  
  # Sanity check: codes flagged via missing_desc should be a subset of the no-desc codes.
  missing_desc_codes_are_subset_of_no_desc <- all(missing_desc_code %in% no_desc)
  
  # ---- Build a unified mapping table and run consistency checks ----
  # Only non-NA pairs go into the mapping table.
  map_tbl <- purrr::map2_dfr(sic_code_cols, sic_desc_cols, \(cc, dc) {
    df %>%
      dplyr::transmute(
        sic_code = as.character(.data[[cc]]),
        sic_desc = as.character(.data[[dc]])
      ) %>%
      dplyr::filter(!is.na(sic_code), !is.na(sic_desc)) %>%
      dplyr::distinct()
  }) %>%
    dplyr::distinct()
  
  # a) Each sic_code should map to exactly one sic_desc
  code_to_desc <- map_tbl %>%
    dplyr::count(sic_code, name = "n_desc") %>%
    dplyr::filter(n_desc != 1)
  
  # b) Each sic_desc should map to exactly one sic_code
  desc_to_code <- map_tbl %>%
    dplyr::count(sic_desc, name = "n_code") %>%
    dplyr::filter(n_code != 1)
  
  # c) SIC codes that appear with NA descriptions in some source and non-NA in another.
  desc_that_is_sometimes_na <- map_tbl %>%
    dplyr::filter(sic_code %in% no_desc)
  
  # ---- Return a single list output (all artifacts + checks) ----
  list(
    sic_code_cols_used = sic_code_cols,
    sic_desc_cols_used = sic_desc_cols,
    
    overflow_checked_code = overflow_checked_code,
    overflow_checked_desc = overflow_checked_desc,
    overflow_summary_code = overflow_summary_code,
    overflow_summary_desc = overflow_summary_desc,
    
    missing_desc_rows = missing_desc,
    missing_desc_code = missing_desc_code,
    missing_desc_codes_are_subset_of_no_desc = missing_desc_codes_are_subset_of_no_desc,
    
    presence_wide = presence_wide,
    presence_tabs = presence_tabs,
    
    map_tbl = map_tbl,
    n_unique_pairs = nrow(map_tbl),
    
    code_to_desc_consistent = (nrow(code_to_desc) == 0),
    desc_to_code_consistent = (nrow(desc_to_code) == 0),
    codes_with_multiple_desc = code_to_desc,
    desc_with_multiple_codes = desc_to_code,
    
    desc_that_is_sometimes_na = desc_that_is_sometimes_na
  )
}

ASSUMPTION A: Each additional SIC encoding is assumed to represent supplementary classification information attributed to that address. The code columns follow the expected pattern; however, the description columns do not. This is likely due to missing descriptions for certain codes. The sic_results$missing_desc_codes_are_subset_of_no_desc = TRUE result confirms that description fields deviating from this pattern do so because of missing descriptions for a given code.

sic_results$overflow_summary_code
#sic_results$overflow_checked_code
# A tibble: 1 × 3
    n_rows     n_ok n_bad
     <int>    <int> <int>
1 10201197 10201197     0
sic_results$overflow_summary_desc
#sic_results$overflow_checked_desc
# A tibble: 1 × 3
    n_rows     n_ok n_bad
     <int>    <int> <int>
1 10201197 10201098    99
# Confirm that description deviations from the expected overflow pattern
# are associated with codes that have no corresponding description.

#sic_results$missing_desc_rows
#sic_results$missing_desc_code
sic_results$missing_desc_codes_are_subset_of_no_desc

ASSUMPTION B: If each column represents an overflow of the same underlying information, a significant degree of overlap in the values present across columns would be expected.

# All SIC codes/descriptions
sic_results$presence_wide

# Tabulated outcomes
sic_results$presence_tabs
                                   row FALSE TRUE
1 primary_sic_code + sic6_descriptions  2563   43
2     sic_code + sic6_descriptions_sic   353 2253
3  sic_code_1 + sic6_descriptions_sic1  1149 1457
4  sic_code_2 + sic6_descriptions_sic2  1652  954
5  sic_code_3 + sic6_descriptions_sic3  1984  622

There is a high degree of overlap across the overflow SIC encodings, which decreases as the overflow index increases. This is expected, as fewer entries carry more than one or two additional SIC classifications.

Only a small subset of possible SIC codes is utilized in the primary column, suggesting that the available selection may be more restricted at the point of data collection. All 43 outcomes in the primary SIC are also used in overflow SIC columns.

# Pull columns that are logical TRUE/FALSE “sic presence” columns
sic_cols <- sic_results$presence_wide |>
  select(where(is.logical)) |>
  names()

other_cols <- setdiff(sic_cols, "primary_sic_code + sic6_descriptions")

# All Primary SIC outcomes are present in the overflow categories
sic_results$presence_wide %>%
  filter(.data[["primary_sic_code + sic6_descriptions"]] & if_any(all_of(other_cols), ~ .x)) %>%
  nrow()

ASSUMPTION C: It is assumed that each code maps uniquely to a single description and vice versa. To verify this assumption, a mapping table is constructed from all non-null (code, description) pairs across all column pairs. Consistency checks:

  1. Each sic_code maps to exactly one description.

  2. Each description maps to exactly one sic_code.

  3. sic_codes associated with an NA description are identified and accounted for.

sic_results$map_tbl

list(
  n_unique_pairs = sic_results$n_unique_pairs,
  code_to_desc_consistent = sic_results$code_to_desc_consistent,
  desc_to_code_consistent = sic_results$desc_to_code_consistent,
  codes_with_multiple_desc = sic_results$codes_with_multiple_desc,
  desc_with_multiple_codes = sic_results$desc_with_multiple_codes,
  desc_that_is_sometimes_na = sic_results$desc_that_is_sometimes_na
)
$n_unique_pairs
[1] 2583

$code_to_desc_consistent
[1] FALSE

$desc_to_code_consistent
[1] FALSE

$codes_with_multiple_desc
# A tibble: 18 × 2
   sic_code n_desc
   <chr>     <int>
 1 279610        2
 2 513705        2
 3 549903        2
 4 561101        2
 5 562101        2
 6 596316        2
 7 799922        2
 8 799927        2
 9 804908        2
10 822298        2
11 824401        2
12 824903        2
13 835102        2
14 864102        2
15 869904        2
16 874881        2
17 912101        2
18 912104        2

$desc_with_multiple_codes
# A tibble: 2 × 2
  sic_desc                       n_code
  <chr>                           <int>
1 COMPUTER SOFTWARE                   2
2 SCHOOLS-BUSINESS & SECRETARIAL      2

$desc_that_is_sometimes_na
# A tibble: 3 × 2
  sic_code sic_desc         
  <chr>    <chr>            
1 866124   MONASTERIES      
2 821120   SCHOOL DISTRICTS 
3 737206   COMPUTER SOFTWARE

Some codes are associated with multiple descriptions and vice versa. Additionally, some codes are sometimes associated with an NA description despite having a valid description in other instances. Each case will be addressed individually.

ASSUMPTION D: Many of these inconsistencies are attributable to variations in nomenclature and will be corrected on an individual basis.

# Entries to correct description
fix_desc <- sic_results$presence_wide %>%
  filter(sic_code %in% sic_results$codes_with_multiple_desc$sic_code) %>%
  select(sic_code, sic_desc) %>%
  arrange(sic_code) %>%
  group_by(sic_code) %>%
  mutate(dup_id = row_number()) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = sic_code,
    names_from = dup_id,
    values_from = sic_desc,
    names_prefix = "sic_desc_"
  ) %>%
  as.data.frame()
   sic_code                                 sic_desc
1    279610                 PRINTERS SERVICES (MFRS)
2    279610                PRINTERS' SERVICES (MFRS)
3    513705                WOMEN'S APPAREL-WHOLESALE
4    513705                 WOMENS APPAREL-WHOLESALE
5    549903       WATER COMPANIES-BOTTLED, BULK, ETC
6    549903       WATER COMPANIES-BOTTLED/BULK & ETC
7    561101      MEN'S CLOTHING & FURNISHINGS-RETAIL
8    561101       MENS CLOTHING & FURNISHINGS-RETAIL
9    562101                   WOMEN'S APPAREL-RETAIL
10   562101                    WOMENS APPAREL-RETAIL
11   596316                 COFFEE & FOOD SVC-MOBILE
12   596316                       MOBILE CONCESSIONS
13   799922                        NATIONAL LANDMARK
14   799922 NATIONAL LANDMARK/HISTORIC SITE/MEMORIAL
15   799927 CHILDREN'S SVCS & ACTIVITIES INFORMATION
16   799927  CHILDRENS SVCS & ACTIVITIES INFORMATION
17   804908               NURSES & NURSES REGISTRIES
18   804908              NURSES & NURSES' REGISTRIES
19   822298   JUNIOR COLLEGES & TECHNICAL INSTITUTES
20   822298 JUNIOR-COMMUNITY COLLEGE-TECH INSTITUTES
21   824401           SCHOOLS-BUSINESS & SECRETARIAL
22   824401            SCHOOLS-BUSINESS & VOCATIONAL
23   824903     SCHOOLS-INDUSTRIAL TECHNICAL & TRADE
24   824903               SCHOOLS-TRADE & VOCATIONAL
25   835102  SCHOOLS-NURSERY & KINDERGARTEN ACADEMIC
26   835102 SCHOOLS-PRE-SCHOOL/KINDERGARTEN-ACADEMIC
27   864102        VETERANS & MILITARY ORGANIZATIONS
28   864102       VETERANS' & MILITARY ORGANIZATIONS
29   869904         WOMEN'S ORGANIZATIONS & SERVICES
30   869904          WOMENS ORGANIZATIONS & SERVICES
31   874881                  EDUCATIONAL ASSESSEMENT
32   874881                   EDUCATIONAL ASSESSMENT
33   912101               GOVERNMENT OFFICES-FEDERAL
34   912101                    GOVERNMENT OFFICES-US
35   912104   GOVERNMENT OFFICES-CITY, VILLAGE & TWP
36   912104    GOVERNMENT OFFICES-CITY/VILLAGE & TWP
# Entries to correct code
fix_code <- sic_results$presence_wide %>%
  semi_join(
    sic_results$presence_wide %>%
      filter(sic_desc %in% sic_results$desc_with_multiple_codes$sic_desc) %>%
      distinct(sic_code),
    by = "sic_code"
  )
# A tibble: 6 × 7
  sic_code sic_desc                primary_sic_code + s…¹ sic_code + sic6_desc…²
  <chr>    <chr>                   <lgl>                  <lgl>                 
1 573401   COMPUTER SOFTWARE       FALSE                  TRUE                  
2 737206   <NA>                    FALSE                  TRUE                  
3 737206   COMPUTER SOFTWARE       FALSE                  TRUE                  
4 824401   SCHOOLS-BUSINESS & SEC… FALSE                  TRUE                  
5 824401   SCHOOLS-BUSINESS & VOC… FALSE                  TRUE                  
6 824402   SCHOOLS-BUSINESS & SEC… FALSE                  TRUE                  
# ℹ abbreviated names: ¹​`primary_sic_code + sic6_descriptions`,
#   ²​`sic_code + sic6_descriptions_sic`
# ℹ 3 more variables: `sic_code_1 + sic6_descriptions_sic1` <lgl>,
#   `sic_code_2 + sic6_descriptions_sic2` <lgl>,
#   `sic_code_3 + sic6_descriptions_sic3` <lgl>
# Entries to correct NA
fix_na <- sic_results$presence_wide %>%
  semi_join(
    sic_results$presence_wide %>%
      filter(sic_desc %in% sic_results$desc_that_is_sometimes_na$sic_desc) %>%
      distinct(sic_code),
    by = "sic_code"
  )
# A tibble: 7 × 7
  sic_code sic_desc          primary_sic_code + sic6_de…¹ sic_code + sic6_desc…²
  <chr>    <chr>             <lgl>                        <lgl>                 
1 573401   COMPUTER SOFTWARE FALSE                        TRUE                  
2 737206   <NA>              FALSE                        TRUE                  
3 737206   COMPUTER SOFTWARE FALSE                        TRUE                  
4 821120   <NA>              FALSE                        TRUE                  
5 821120   SCHOOL DISTRICTS  FALSE                        TRUE                  
6 866124   <NA>              TRUE                         TRUE                  
7 866124   MONASTERIES       TRUE                         TRUE                  
# ℹ abbreviated names: ¹​`primary_sic_code + sic6_descriptions`,
#   ²​`sic_code + sic6_descriptions_sic`
# ℹ 3 more variables: `sic_code_1 + sic6_descriptions_sic1` <lgl>,
#   `sic_code_2 + sic6_descriptions_sic2` <lgl>,
#   `sic_code_3 + sic6_descriptions_sic3` <lgl>

These SIC code characteristics are expected to vary with additional years of data or subsequent exports of previous reports, as Data Axle may update their database over time. It will therefore be important to process these columns from a raw data export prior to beginning any analysis or data validation.

All unique SIC values for a given ABI and address will be retained and expanded as overflow columns. The primary encoding will be constrained to ensure it remains consistent with qualifying classifiers, while all additional SIC columns will represent any other unique SIC values present.

Handling Additional Metadata

In the previous section, two encodings were examined: NAICS and SIC. NAICS remained uninformative, while SIC provided additional important outlet details (described above). The remaining metadata fall into two handling categories:

  1. Metadata not directly relevant to the current project. These fields will be retained with no effort made to clean or validate the values.

  2. Metadata supplying additional location details, such as county code, census block, and FIPS codes. These fields will be assessed for validation in subsequent steps while being handled similarly to category (a). Note that attributed census block and related fields may not translate directly to specific decennial maps, potentially necessitating reference to an external database.

As described above, metadata can vary by location and across the years a business was recorded. As no cleaning or validation effort is being applied to category (a), both location and year-level variation will be retained. The same approach will apply to category (b), with the addition of Boolean validation checks to capture:

  • Whether the same location outcome spans the complete expected date range.

  • Whether multiple location outcomes are associated with the same address.

  • Whether multiple identified location outcomes vary by decennial period or follow alternative patterns (source)

Show the code
# ---- Functions to Construct the Unique Outcomes with Dates String ----

summarize_code_ranges <- function(df, code_col, abi_col = "abi", address_col = "combined_address", year_col = "archive_version_year") {
  #' @description
  #' Summarize contiguous year ranges for a code by ABI + address (data.table)
  #'
  #' For each ABI/address pair, this creates a single string describing each
  #' distinct `code_col` value and the contiguous year ranges in which it appears,
  #' e.g. `"A (2018-2020), B (2022)"`.
  #'
  #' Rows are deduplicated at the ABI/address/year/code level before computing
  #' runs. ABI/address pairs with no non-missing `code_col`+`year_col` observations
  #' are retained and return `NA` in the output column.
  #'
  #' @param df A data.frame/data.table containing ABI, address, code, and year fields.
  #' @param code_col Character scalar. Column name of the code/value to summarize.
  #' @param abi_col Character scalar. Column name for ABI. Default `"abi"`.
  #' @param address_col Character scalar. Column name for address. Default `"combined_address"`.
  #' @param year_col Character scalar. Column name for the year field. Default `"archive_version_year"`.
  #'
  #' @return A data.frame with columns `{abi_col}`, `{address_col}`, and
  #'   `{paste0(code_col, "_ranges")}`.
  #' @export
  
  stopifnot(is.character(code_col), length(code_col) == 1)
  out_col <- paste0(code_col, "_ranges")
  
  # Convert to data.table (no copy if df already is one)
  DT <- data.table::as.data.table(df)
  
  # Build the ABI/address key set we always want to keep in the output
  keys <- unique(
    DT[!is.na(get(abi_col)) & !is.na(get(address_col)),
       .(abi = get(abi_col), address = get(address_col))]
  )
  
  # ---- Overall "available" contiguous year ranges per ABI/address (ignores code_col) ----
  Y <- DT[
    !is.na(get(abi_col)) &
      !is.na(get(address_col)) &
      !is.na(get(year_col)),
    .(
      abi     = get(abi_col),
      address = get(address_col),
      .year   = suppressWarnings(as.integer(get(year_col)))
    )
  ][!is.na(.year)]  # drop rows where year can't be parsed
  
  if (nrow(Y) == 0L) {
    # No valid years anywhere: everyone gets NA for availability
    avail <- keys[, .(available_year_ranges = NA_character_)]
  } else {
    # Sort + de-duplicate years within ABI/address
    data.table::setorder(Y, abi, address, .year)
    Y <- unique(Y, by = c("abi", "address", ".year"))
    
    # Mark runs of consecutive years (new run when year != previous + 1)
    Y[, .run := cumsum(.year != data.table::shift(.year, type = "lag", fill = .year[1L]) + 1L),
      by = .(abi, address)]
    
    # Summarize each run to start/end
    avail_runs <- Y[, .(start_year = min(.year), end_year = max(.year)),
                    by = .(abi, address, .run)]
    
    # Format run as "YYYY" or "YYYY-YYYY"
    avail_runs[, date_range := data.table::fifelse(
      start_year == end_year,
      as.character(start_year),
      paste0(start_year, "-", end_year)
    )]
    
    # Collapse all runs to one string per ABI/address
    data.table::setorder(avail_runs, abi, address, start_year)
    avail <- avail_runs[, .(available_year_ranges = paste(date_range, collapse = ", ")),
                        by = .(abi, address)]
  }
  
  # ---- Code-specific contiguous year ranges per ABI/address/code ----
  X <- DT[
    !is.na(get(abi_col)) &
      !is.na(get(address_col)) &
      !is.na(get(code_col)) &
      !is.na(get(year_col)),
    .(
      abi     = get(abi_col),
      address = get(address_col),
      .year   = suppressWarnings(as.integer(get(year_col))),
      .code   = as.character(get(code_col))
    )
  ][!is.na(.year)]
  
  if (nrow(X) == 0L) {
    # No valid code+year rows: return keys + availability + NA code ranges
    out <- merge(keys, avail, by = c("abi", "address"), all.x = TRUE, sort = FALSE)
    out[, (out_col) := NA_character_]
  } else {
    # Sort + de-duplicate at ABI/address/year/code
    data.table::setorder(X, abi, address, .code, .year)
    X <- unique(X, by = c("abi", "address", ".year", ".code"))
    
    # Mark runs of consecutive years within each ABI/address/code
    X[, .run := cumsum(.year != data.table::shift(.year, type = "lag", fill = .year[1L]) + 1L),
      by = .(abi, address, .code)]
    
    # Summarize each run to start/end
    runs <- X[, .(start_year = min(.year), end_year = max(.year)),
              by = .(abi, address, .code, .run)]
    
    # Format run as "YYYY" or "YYYY-YYYY"
    runs[, date_range := data.table::fifelse(
      start_year == end_year,
      as.character(start_year),
      paste0(start_year, "-", end_year)
    )]
    
    # Create "CODE (range)" items and collapse per ABI/address
    data.table::setorder(runs, abi, address, start_year, .code)
    runs[, .item := paste0(.code, " (", date_range, ")")]
    
    res <- runs[, .(val = paste(.item, collapse = ", ")), by = .(abi, address)]
    data.table::setnames(res, "val", out_col)
    
    # Merge everything back to full key set
    out <- merge(keys, avail, by = c("abi", "address"), all.x = TRUE, sort = FALSE)
    out <- merge(out,  res,  by = c("abi", "address"), all.x = TRUE, sort = FALSE)
    
    # Normalize empty strings to NA
    out[, (out_col) := data.table::fifelse(get(out_col) == "", NA_character_, get(out_col))]
  }
  
  # Restore original column names
  data.table::setnames(out, c("abi", "address"), c(abi_col, address_col))
  
  # Put available_year_ranges right after combined_address (address_col)
  data.table::setcolorder(
    out,
    c(
      abi_col, address_col, "available_year_ranges",
      setdiff(names(out), c(abi_col, address_col, "available_year_ranges"))
    )
  )
  
  as.data.frame(out)
}




summarize_many_code_ranges_dt <- function(df, vars, abi_col = "abi", address_col = "combined_address", year_col = "archive_version_year") {
  #' @description
  #' Summarize contiguous year ranges for many code columns (with progress bar)
  #'
  #' Calls `summarize_code_ranges()` for each variable in `vars` and merges results by ABI/address.
  #' A console progress bar is displayed using `utils::txtProgressBar()`.
  #'
  #' @param df A data.frame/data.table containing ABI, address, and year fields.
  #' @param vars Character vector of column names to summarize (each becomes `*_ranges`).
  #' @param abi_col Character scalar. Column name for ABI. Default `"abi"`.
  #' @param address_col Character scalar. Column name for address. Default `"combined_address"`.
  #' @param year_col Character scalar. Column name for the year field. Default `"archive_version_year"`.
  #'
  #' @return A data.frame with one row per ABI/address and one `*_ranges` column per
  #'   element of `vars`.
  #' @export
  
  stopifnot(is.character(vars), length(vars) >= 1L)
  
  DT <- as.data.table(df)
  
  # Compute the base output ONCE (includes available_year_ranges)
  base <- summarize_code_ranges(
    df,
    code_col = vars[[1]],
    abi_col = abi_col,
    address_col = address_col,
    year_col = year_col
  )
  
  # Convert base to data.table and standardize key names for merging
  out <- as.data.table(base)
  setnames(out, c(abi_col, address_col), c("abi", "address"))
  
  pb <- utils::txtProgressBar(min = 0, max = length(vars), style = 3)
  on.exit(close(pb), add = TRUE)
  
  for (i in seq_along(vars)) {
    v <- vars[[i]]
    newcol <- paste0(v, "_ranges")
    
    tmp <- summarize_code_ranges(
      df,
      code_col = v,
      abi_col = abi_col,
      address_col = address_col,
      year_col = year_col
    )
    
    tmpDT <- as.data.table(tmp)
    setnames(tmpDT, c(abi_col, address_col), c("abi", "address"))
    
    # Drop availability column(s) from tmp (we keep the one from base)
    tmpDT[, grep("^available_year_ranges(\\.|$)", names(tmpDT), value = TRUE) := NULL]
    
    # If out already has this *_ranges column, don't merge a second copy
    if (newcol %in% names(out)) {
      tmpDT[, (newcol) := NULL]
    }
    
    # Merge once (your original code merged twice, which guarantees duplicates)
    out <- merge(out, tmpDT, by = c("abi", "address"), all.x = TRUE, sort = FALSE)
    
    utils::setTxtProgressBar(pb, i)
  }
  
  # Restore original ABI/address column names
  setnames(out, c("abi", "address"), c(abi_col, address_col))
  
  # Ensure available_year_ranges is right after the address column
  setcolorder(
    out,
    c(
      abi_col, address_col, "available_year_ranges",
      setdiff(names(out), c(abi_col, address_col, "available_year_ranges"))
    )
  )
  
  as.data.frame(out)
}




# ---- Functions to Quality Check Compressed Location Information ----

avail_years <- function(x) {
  #' @description
  #' Expand an "available years" string into a sorted unique vector of years.
  #' Parses strings like `"2002-2005, 2007, 2010-2012"` into an integer vector.
  #' Used by [coverage_checks()] (and therefore indirectly by [check_ranges_same_outcome()]).
  #'
  #' @param x Character scalar. Comma-separated list of years or year ranges in the form
  #'   `"YYYY"` or `"YYYY-YYYY"`. Whitespace is allowed.
  #'
  #' @return Integer vector of years (may be length 0 if `x` is empty/NA/unparseable).
  
  # Treat NA/blank as no years
  if (is.na(x) || !nzchar(trimws(x))) return(integer(0))
  
  # Split into tokens on commas (allow whitespace around commas)
  parts <- stringr::str_split(x, "\\s*,\\s*")[[1]]
  
  # Convert each token to years
  yrs <- purrr::map(parts, function(p) {
    # Match YYYY or YYYY-YYYY (with optional whitespace)
    m <- stringr::str_match(p, "^\\s*(\\d{4})(?:\\s*-\\s*(\\d{4}))?\\s*$")
    
    # Ignore unparseable token
    if (anyNA(m)) return(integer(0))
    
    # Start year
    s <- as.integer(m[2])
    
    # End year (or start if missing)
    e <- as.integer(dplyr::if_else(is.na(m[3]), m[2], m[3]))
    
    # Expand inclusive range
    seq.int(s, e)
  })
  
  # Flatten, unique, sort
  sort(unique(unlist(yrs)))
}




normalize_year <- function(y) {
  #' @description
  #' Normalize a year token into a 4-digit integer year and keeps digits only. If 
  #' more than 4 digits are present, uses the *last 4* digits (e.g., `"20011"` -> 
  #' `2011`). Used by [year_value_map()] (and therefore indirectly by
  #' [coverage_checks()] and [check_ranges_same_outcome()]).
  #'
  #' @param y A value coercible to character containing a year.
  #'
  #' @return Integer year (e.g., `2011`) or `NA_integer_` if no digits are found.
  
  # Keep digits only
  y <- gsub("\\D", "", as.character(y))
  
  # No digits => NA
  if (!nzchar(y)) return(NA_integer_)
  
  # If too long, keep last 4 digits (e.g., 20011 -> 2011)
  if (nchar(y) > 4) y <- substr(y, nchar(y) - 3, nchar(y))
  
  # Coerce to integer year
  as.integer(y)
}




year_value_map <- function(x) {
  #' @description
  #' Parse a ranges string into a year -> value map (and detect conflicting overlaps).
  #' Parses `"A (2002), B (2003-2005)"` into a mapping from each year to a value label.
  #' Overlapping years with different labels are returned as conflicts.
  #' Used by [coverage_checks()] (and therefore indirectly by [check_ranges_same_outcome()]).
  #'
  #' @param x Character scalar like `"0 (2002), 1 (2011-2016), 2 (2021-2022)"`.
  #'
  #' @return A list with:
  #' \itemize{
  #'   \item `map`: named character vector (names are years, values are labels).
  #'   \item `conflict_years`: integer vector of years that have >1 distinct label.
  #' }
  
  # Empty input => empty mapping
  if (is.na(x) || !nzchar(trimws(x))) {
    return(list(
      map = setNames(character(0), character(0)),
      conflict_years = integer(0)
    ))
  }
  
  # Extract "<val> (YYYY[-YYYY])" segments
  m <- stringr::str_match_all(
    x,
    "(?:^|,\\s*)([^,()]+?)\\s*\\(\\s*(\\d{4,})(?:\\s*-\\s*(\\d{4,}))?\\s*\\)"
  )[[1]]
  
  # No matches => empty mapping
  if (nrow(m) == 0) {
    return(list(
      map = setNames(character(0), character(0)),
      conflict_years = integer(0)
    ))
  }
  
  # Segment labels
  vals <- stringr::str_trim(m[, 2])
  
  # Start years (normalized)
  starts <- vapply(m[, 3], normalize_year, integer(1))
  
  # End years (or start if missing), normalized
  ends <- vapply(ifelse(is.na(m[, 4]), m[, 3], m[, 4]), normalize_year, integer(1))
  
  # Drop unparseable segments
  keep <- !is.na(starts) & !is.na(ends)
  vals <- vals[keep]
  starts <- starts[keep]
  ends <- ends[keep]
  
  # All segments dropped => empty mapping
  if (length(vals) == 0) {
    return(list(
      map = setNames(character(0), character(0)),
      conflict_years = integer(0)
    ))
  }
  
  # Expand to one row per year
  long <- purrr::map2_dfr(seq_along(vals), vals, function(i, v) {
    # Ignore inverted ranges like "2016-2011"
    if (ends[i] < starts[i]) {
      return(tibble::tibble(year = integer(0), value = character(0)))
    }
    
    # Inclusive expansion
    tibble::tibble(year = seq.int(starts[i], ends[i]), value = v)
  })
  
  # Nothing expanded => empty mapping
  if (nrow(long) == 0) {
    return(list(
      map = setNames(character(0), character(0)),
      conflict_years = integer(0)
    ))
  }
  
  # Years with >1 distinct label (overlaps)
  conflict_years <- long |>
    dplyr::group_by(year) |>
    dplyr::summarise(nv = dplyr::n_distinct(value), .groups = "drop") |>
    dplyr::filter(nv > 1) |>
    dplyr::pull(year)
  
  # First label per year (stable map for downstream checks)
  map_first <- long |>
    dplyr::group_by(year) |>
    dplyr::summarise(value = dplyr::first(value), .groups = "drop")
  
  # Named vector year -> value + conflicts
  list(
    map = setNames(map_first$value, as.character(map_first$year)),
    conflict_years = conflict_years
  )
}




compress_years <- function(years) {
  #' @description
  #' Compress a set of years into "start:end" runs.
  #' `c(2003,2004,2005,2010,2012,2013)` -> `"2003:2005, 2010, 2012:2013"`.
  #' Used by [coverage_checks()] (and therefore indirectly by [check_ranges_same_outcome()]).
  #'
  #' @param years Integer (or coercible) vector of years.
  #'
  #' @return Character scalar. Empty string if `years` is empty.
  
  # Normalize + sort + unique
  years <- sort(unique(as.integer(years)))
  
  # Drop NA
  years <- years[!is.na(years)]
  
  # Nothing to format
  if (length(years) == 0) return("")
  
  # New run when gap != 1
  breaks <- c(TRUE, diff(years) != 1L)
  
  # Run id per year
  grp <- cumsum(breaks)
  
  # Convert each run to "start:end" (or single year)
  parts <- vapply(split(years, grp), function(v) {
    # Single year
    if (length(v) == 1) return(as.character(v[1]))
    
    # Consecutive run
    paste0(v[1], ":", v[length(v)])
  }, character(1))
  
  # Join runs
  paste(parts, collapse = ", ")
}




coverage_checks <- function(avail_str, ranges_str) {
  #' @description
  #' Compute coverage/uniqueness and "decade consistency" checks for one row.
  #' Used by [check_ranges_same_outcome()] to evaluate each `(available_year_ranges, *_ranges)`
  #' pair row-by-row.
  #'
  #' Returns `ok` (single value across all available years) and `decade_ok` (only when `ok` is FALSE:
  #' within each decade, values are consistent; gaps are ignored).
  #'
  #' @param avail_str Character scalar, e.g. `"2002-2022"`.
  #' @param ranges_str Character scalar, e.g. `"0 (2002), 1 (2011-2016), 2 (2021-2022)"`.
  #'
  #' @return Named list: `ok`, `decade_ok`, `missing_years`, `values_seen`, `conflict_years`.
  
  # Expand availability to explicit years
  yrs <- avail_years(avail_str)
  
  # Can't check without available years
  if (length(yrs) == 0) {
    return(list(
      ok = NA,
      decade_ok = NA,
      missing_years = NA_character_,
      values_seen = NA_character_,
      conflict_years = NA_character_
    ))
  }
  
  # Parse mapping + conflicts
  parsed <- year_value_map(ranges_str)
  
  # Named vector year -> value
  m <- parsed$map
  
  # Years as character for matching names()
  avail_chr <- as.character(yrs)
  
  # Missing available years
  missing <- setdiff(avail_chr, names(m))
  
  # Available years that are present in the mapping
  present_years <- intersect(avail_chr, names(m))
  
  # Values for those present available years
  present_vals <- unname(m[present_years])
  
  # Drop empty values defensively
  present_vals <- present_vals[nzchar(present_vals)]
  
  # Distinct values observed across available years
  uniq_vals_all <- unique(present_vals)
  
  # Overall pass/fail: full coverage, no conflicts, and exactly one unique value
  ok <- (length(missing) == 0) &&
    (length(parsed$conflict_years) == 0) &&
    (length(present_vals) == length(avail_chr)) &&
    (length(uniq_vals_all) == 1)
  
  # Decade check only when ok is FALSE
  if (isTRUE(ok)) {
    # Not needed when overall check passes
    decade_ok <- NA
  } else {
    # Fail if nothing is present at all
    if (length(present_years) == 0) {
      decade_ok <- FALSE
    } else {
      # Convert years to decade buckets (e.g., 2011 -> 2010)
      yy <- as.integer(present_years)
      d0 <- (yy %/% 10L) * 10L
      
      # Within each decade, require <= 1 distinct value (ignore gaps)
      decade_ok <- tibble::tibble(decade = d0, value = present_vals) |>
        dplyr::group_by(decade) |>
        dplyr::summarise(nv = dplyr::n_distinct(value), .groups = "drop") |>
        dplyr::summarise(all_one = all(nv <= 1L), .groups = "drop") |>
        dplyr::pull(all_one) |>
        dplyr::first(default = FALSE)
    }
  }
  
  # Return checks + formatted diagnostics
  list(
    ok = ok,
    decade_ok = decade_ok,
    missing_years = compress_years(missing),
    values_seen = if (length(uniq_vals_all) == 0) "" else paste(sort(uniq_vals_all), collapse = ", "),
    conflict_years = compress_years(parsed$conflict_years)
  )
}




check_ranges_same_outcome <- function(df,
                                      available_col = "available_year_ranges",
                                      ranges_suffix = "_ranges") {
  #' @description
  #' Check all `*_ranges` columns against an availability column.
  #' Main entry point. For every column ending in `ranges_suffix`, computes:
  #' `{col}_ok`, `{col}_decade_ok`, `{col}_missing_years`, `{col}_values_seen`, 
  #' `{col}_conflict_years`.
  #'
  #' Internally, it calls [coverage_checks()] for each row/column pair.
  #'
  #' @param df A data.frame/tibble containing an availability column and one or more `*_ranges` columns.
  #' @param available_col Name of the column containing available year ranges.
  #' @param ranges_suffix Suffix used to identify the range columns (default `"_ranges"`).
  #'
  #' @return `df` with additional result columns appended.
  
  stopifnot(available_col %in% names(df))
  
  # Range columns in ORIGINAL order (from df)
  range_cols <- names(df)[stringr::str_ends(names(df), stringr::fixed(ranges_suffix))]
  range_cols <- setdiff(range_cols, available_col)
  
  # Start with original df
  out <- df
  
  # Build outputs (without inserting yet) so we can reorder deterministically
  created_names <- character(0)
  
  for (col in range_cols) {
    base <- stringr::str_remove(col, stringr::fixed(ranges_suffix))
    res  <- purrr::map2(df[[available_col]], df[[col]], coverage_checks)
    
    nm_ok        <- paste0(base, "_ranges_ok")
    nm_decade_ok <- paste0(base, "_ranges_decade_ok")
    nm_values    <- paste0(base, "_values_seen")
    nm_missing   <- paste0(base, "_missing_years")
    nm_conflict  <- paste0(base, "_conflict_years")
    
    out[[nm_ok]]        <- purrr::map_lgl(res, "ok")
    out[[nm_decade_ok]] <- purrr::map(res, "decade_ok") |> unlist(use.names = FALSE)
    out[[nm_values]]    <- purrr::map_chr(res, "values_seen")
    out[[nm_missing]]   <- purrr::map_chr(res, "missing_years")
    out[[nm_conflict]]  <- purrr::map_chr(res, "conflict_years")
    
    created_names <- c(created_names, nm_ok, nm_decade_ok, nm_missing, nm_values, nm_conflict)
  }
  
  # Reorder columns so each block of derived columns sits immediately after its *_ranges column
  # Keep all non-created columns in their original relative order.
  orig <- names(df)
  is_created <- names(out) %in% created_names
  
  ordered <- character(0)
  for (nm in orig) {
    ordered <- c(ordered, nm)
    
    if (nm %in% range_cols) {
      base <- stringr::str_remove(nm, stringr::fixed(ranges_suffix))
      ordered <- c(
        ordered,
        paste0(base, "_ranges_ok"),
        paste0(base, "_ranges_decade_ok"),
        paste0(base, "_values_seen"),
        paste0(base, "_missing_years"),
        paste0(base, "_conflict_years")
      )
    }
  }
  
  # Append any remaining columns (e.g., if df had other columns added elsewhere)
  remaining <- setdiff(names(out), ordered)
  out <- out[, c(ordered, remaining), drop = FALSE]
  
  out
}

Supplementary Location Metadata

The following illustrates the process of condensing additional location metadata into one line per unique ABI and address combination. Although some of these fields are numeric, they will be coerced to character, as each represents a classification regardless of whether it is encoded as a numeric value. Only longitude and latitude will be retained and treated as numeric.

vars_for_loc <- c(
  # Core business location details (8)
  "census_block", "census_tract", "county_code",  "fips_code", "cbsa_level", 
  "cbsa_code", "csa_code", "area_code"
)

loc_condensed <- summarize_many_code_ranges_dt(subset, vars_for_loc) %>%
  arrange(abi)

#' @description Codebook for the output fields produced by the evaluation. For
#'              each variable condensed to unique abi and address combination,
#'              additional checks are added.
#' 
#' @field _ranges_OK Logical. TRUE when the `*_ranges` field (e.g., 
#'                   `census_tract_ranges`) assigns a value for every year 
#'                   listed in `available_year_ranges`, has no overlapping year 
#'                   conflicts (a year mapped to more than one distinct value), 
#'                   and all available years map to exactly one unique value 
#'                   overall. NA when `available_year_ranges` is missing or 
#'                   cannot be parsed.
#' 
#' @field _ranges_decade_ok Logical or NA. Only evaluated when `*_ranges_ok` 
#'                          is FALSE. TRUE when, within each decade that has 
#'                          any mapped available years, all mapped years in that
#'                          decade share the same value (i.e., ≤ 1 distinct 
#'                          value per decade). This check ignores gaps (missing 
#'                          years) within decades. NA when `*_ranges_ok` is TRUE 
#'                          or when `available_year_ranges` is missing/unparseable.
#' 
#' @field _values_seen Character. Comma-separated list of distinct values 
#'                     observed for available years in the `*_ranges` mapping 
#'                     (restricted to years in `available_year_ranges`). Empty 
#'                     string if none are observed.
#' 
#' @field _missing_years Character. Years present in `available_year_ranges` 
#'                       that are not covered by the `*_ranges` mapping. Empty 
#'                       string if none are missing. NA when 
#'                       `available_year_ranges` is missing/unparseable.
#' 
#' @field _conflict_years Character. Years that are assigned more than one 
#'                        distinct value due to overlapping segments in the 
#'                        `*_ranges` mapping, compressed as runs in the form 
#'                        `"YYYY:YYYY"` and separated by `", "`. Empty string 
#'                        if no conflicts.

area_code_qc <- check_ranges_same_outcome(loc_condensed)
  available_year_ranges
1             2003-2004
2             2005-2006
3             2007-2011
4             2012-2013
5  2007-2019, 2023-2025
6             2005-2025
7             2002-2022
8             2004-2025
                                         census_tract_ranges
1                                         012101 (2003-2004)
2                                         012102 (2005-2006)
3                                         012101 (2007-2011)
4                                         012101 (2012-2013)
5 275510 (2007-2016), 275500 (2017-2019), 275500 (2023-2025)
6                     013701 (2005-2023), 013703 (2024-2025)
7      000000 (2002), 001700 (2003-2016), 001701 (2017-2022)
8                                         004412 (2004-2025)
  census_tract_ranges_ok census_tract_ranges_decade_ok census_tract_values_seen
1                   TRUE                            NA                   012101
2                   TRUE                            NA                   012102
3                   TRUE                            NA                   012101
4                   TRUE                            NA                   012101
5                  FALSE                         FALSE           275500, 275510
6                  FALSE                         FALSE           013701, 013703
7                  FALSE                         FALSE   000000, 001700, 001701
8                   TRUE                            NA                   004412
  census_tract_missing_years census_tract_conflict_years
1                       <NA>                        <NA>
2                       <NA>                        <NA>
3                       <NA>                        <NA>
4                       <NA>                        <NA>
5                       <NA>                        <NA>
6                       <NA>                        <NA>
7                       <NA>                        <NA>
8                       <NA>                        <NA>

Looking at the results for the census_tract variable, nearly half of the entries failed the consistency check and did not pass the check confirming that updates were made over decennial periods, as would be expected. This may be correct or could indicate broader data accuracy problems; either way, it points to notable inconsistencies that would be challenging to reconcile given the available results.

This assessment will be incorporated when condensing the full dataset. However, these findings suggest that the additional location values may need to be freshly associated using a trusted database rather than relying on the provided information.

Metadata Not Contributive to Project Goals

The following illustrates the process of condensing additional other metadata into one line per unique ABI and address combination. Note that primary_naics2_code is not represented here, as it will be derived from primary_naics8_code in the main dataset. This field falls under the “Unknown Variables or Code Translations” category.

vars_to_sum <- c(
  # Core business details by business ID (4):
  "company", "year_established", "subsidiary_number", "company_holding_status",
  # Core business details by location and might change by year (7):
  "site_number", "yellow_page_code", "office_size_code", "employee_size_location",
  "location_employee_size_code", "sales_volume_location", "location_sales_volume_code",
  # Parent company details, where some might change by year (5):
  "parent_number", "parent_actual_employee_size", "parent_employee_size_code",
  "parent_actual_sales_volume", "parent_sales_volume_code",
  # Unknown variables or code translations (8):
  #primary_naics2_code
  "address_type_indicator", "industry_specific_first_byte", 
  "business_status_code", "population_code", "match_code", "ticker", "idcode"
)

subset_condensed <- summarize_many_code_ranges_dt(subset, vars_to_sum) %>%
  arrange(abi)
  available_year_ranges                          employee_size_location_ranges
1             2003-2004                                          4 (2003-2004)
2             2005-2006                                          4 (2005-2006)
3             2007-2011                                          3 (2007-2011)
4             2012-2013                                     3 (2012), 2 (2013)
5  2007-2019, 2023-2025                 3 (2007-2019), 3 (2023), 2 (2024-2025)
6             2005-2025                           9 (2005-2006), 8 (2007-2025)
7             2002-2022                 2 (2003), 4 (2004-2006), 1 (2007-2022)
8             2004-2025 10 (2004-2005), 9 (2006-2009), 3 (2010), 1 (2011-2025)
           location_employee_size_code_ranges  business_status_code_ranges
1                               A (2003-2004)                9 (2003-2004)
2                               A (2005-2006)                9 (2005-2006)
3                               A (2007-2011)                9 (2007-2011)
4                               A (2012-2013)                9 (2012-2013)
5                A (2007-2019), A (2023-2025) 9 (2007-2019), 9 (2023-2025)
6                               B (2005-2025)                9 (2005-2025)
7                               A (2002-2022)                9 (2003-2022)
8 C (2004-2005), B (2006-2009), A (2010-2025)                9 (2004-2025)
                       population_code_ranges
1                               6 (2003-2004)
2                               6 (2005-2006)
3                               6 (2007-2011)
4                               6 (2012-2013)
5 5 (2007-2015), 7 (2016-2019), 7 (2023-2025)
6                6 (2005-2015), 7 (2016-2025)
7                1 (2002-2015), 5 (2016-2022)
8                8 (2004-2015), 7 (2016-2025)

Concluding Thoughts

This document outlined additional insights and data processing requirements identified in the 2026 data format. A condensing framework is provided to collapse metadata fields that are not relevant to the project or are of questionable quality. This method retains unique values along with the years in which they were recorded, allowing analysts to preserve this information for follow-up data quality checks, validation, and cleaning.

Metadata Handling Summary

Rows will be condensed by unique abi and combined_address. NAICS codes will be split into the six leading digits and a trailing two-digit proprietary code, with the latter lacking a description. The six-digit section is consistent across all entries, while the trailing two digits will be condensed by unique value and observed date range.

SIC codes will be cleaned to ensure consistency and reconcile unexpected missing values or duplications. After condensing, the primary SIC codes will be organized first, followed by any additional overflow values detected across the years the ABI and address are recorded.

Business location will continue to be primarily attributed to the given address and longitude/latitude values. The assessment of the additional provided location fields presented here reveals unexpected inconsistencies that would be difficult or impractical to reconcile. For this reason, it is recommended that decennial location information be associated to each entry by geolocation, as outlined in the 2023 data form processing methods.

In addition to the above, data cleaning insights gleaned from the 2023 format are assumed to apply to this version as well: Review - 2023 Format.

Step 1 Algorithm Updates

The 2026 format begins in a different structure (long vs. wide), providing the opportunity to approach Step 1 of the outlined data processing differently: Step 1 Results - 2023 Format.

Prior to transforming the data from long form (years as rows) to wide form (years as columns), addresses will be assessed for similarity. Identified similar addresses will be applied at the time of metadata condensing to avoid a secondary condensing step on the wide form. Using the prior Step 1 method, duplications were mistakenly introduced and some addresses failed the geolocation quality check.

As a consequence, the following modifications will be implemented to account for these variations at the time of transformation:

  • Duplications were largely attributed to entries where the same address_line_1 details were associated with different cities and/or zip codes. Entries will be condensed by the compiled address instead of address_line_1 alone.

  • The widened version of the ABI subset will undergo column-wise evaluation to ensure each year is only represented once (sum to 1 or 0). Rows for a given ABI that fail this will be marked as duplicated while all else will be marked as passed. Entries associated with an ABI where no duplication was added will be marked as NA.

  • Before condensing, prospective similar addresses will undergo a longitude/latitude similarity test. Those that fail will not be condensed by similar address, but will instead be condensed by unique value. Addresses that fail this test but share the same address_line_1 value will be marked for override yet still condensed by unique value, allowing for follow-up once addresses have been validated using the USPS API in Step 2.

Back to top